Dna Capitals Hackathon


Summary

The aim of this hackathon is to extract the specific information about a company. The information has to be extracted using the training sample of company names and at the end of hackathon it is validated against test set of companies.

Project Links

Description

The project is divided into two division
  • Company Info Crawler - Ruby on Rails application to extract generic information from the target website.
  • Web Scrapper - Python Script with beautifulSoup to extract specific information from the target website.

WebScrapper

BeautifulSoup is used to extract specific information about companies from open source websites (wikipedia, pitchbook and bloomberg). The company website is identified by extracting website names from various browsers using different combinations of keywords and then the most relevant website is chosen. The information scrapped are
  • Sector to which the belongs
  • About the company
  • Contact details
  • Investors of the company
  • News articles about the company

Company Info Crawler

The website obtained from Web Scrapper is used to extract generic information about the company. The content extracted from the website using HttParty is parsed using Nokogiri and indexed using Elastic Search. This process is carried out as background task using sidekiq. Most informational content for a query is returned using context similarity (LSI, Carrot Clustering). Latent Dirichlet allocation is used to extract the topics from the content indexed based on the query
keyword search

Meta Data of a Company

keyword search

Query Results

Appreciations

Won the first prize at the Hackathon
← Previous project
Next project →